Visual question answering: A survey of methods and datasets
نویسندگان
چکیده
Visual Question Answering (VQA) is a challenging task that has received increasing attention from both the computer vision and the natural language processing communities. Given an image and a question in natural language, it requires reasoning over visual elements of the image and general knowledge to infer the correct answer. In the first part of this survey, we examine the state of the art by comparing modern approaches to the problem. We classify methods by their mechanism to connect the visual and textual modalities. In particular, we examine the common approach of combining convolutional and recurrent neural networks to map images and questions to a common feature space. We also discuss memory-augmented and modular architectures that interface with structured knowledge bases. In the second part of this survey, we review the datasets available for training and evaluating VQA systems. The various datatsets contain questions at different levels of complexity, which require different capabilities and types of reasoning. We examine in depth the question/answer pairs from the Visual Genome project, and evaluate the relevance of the structured annotations of images with scene graphs for VQA. Finally, we discuss promising future directions for the field, in particular the connection to structured knowledge bases and the use of natural language processing models.
منابع مشابه
Survey of Visual Question Answering: Datasets and Techniques
Visual question answering (or VQA) is a new and exciting problem that combines natural language processing and computer vision techniques. We present a survey of the various datasets and models that have been used to tackle this task. The first part of this survey details the various datasets for VQA and compares them along some common factors. The second part of this survey details the differe...
متن کاملA Survey of Datasets for Biomedical Question Answering Systems
The massively ever increasing amount of textual and linked biomedical data available online poses many challenges for information seekers. So, the focus of information retrieval community has shifted to precise information retrieval, i.e. providing exact answer to a user question. In recent years, many datasets related to Biomedical Question Answering (BioQA) have emerged which the researchers ...
متن کاملSurvey of Recent Advances in Visual Question Answering
Visual Question Answering (VQA) presents a unique challenge as it requires the ability to understand and encode the multi-modal inputs in terms of image processing and natural language processing. The algorithm further needs to learn how to perform reasoning over this multi-modal representation so it can answer the questions correctly. This paper presents a survey of different approaches propos...
متن کاملInvestigating Embedded Question Reuse in Question Answering
The investigation presented in this paper is a novel method in question answering (QA) that enables a QA system to gain performance through reuse of information in the answer to one question to answer another related question. Our analysis shows that a pair of question in a general open domain QA can have embedding relation through their mentions of noun phrase expressions. We present methods f...
متن کاملMultimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding
Modeling textual or visual information with vector representations trained from large language or visual datasets has been successfully explored in recent years. However, tasks such as visual question answering require combining these vector representations with each other. Approaches to multimodal pooling include element-wise multiplication or addition, as well as concatenation of the visual a...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Computer Vision and Image Understanding
دوره 163 شماره
صفحات -
تاریخ انتشار 2017